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(57) ABSTRACT 

A framework is provided for describing multimedia content 
and a system in which a plurality of multimedia storage 
devices employing the content description methods of the 
present invention can interoperate. In accordance with one 
form of the present invention, the content description frame- 
work is a description scheme (DS) for describing streams or 
aggregations of multimedia objects, which may comprise 
audio, images, video, text, time series, and various other 
modalities. This description scheme can accommodate an 
essentially limitless number of descriptors in terms of 
features, semantics or metadata, and facilitate content-based 
search, index, and retrieval, among other capabilities, for 
both streamed or aggregated multimedia objects. 
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MULTIMEDIA CONTENT DESCRIPTION 
FRAMEWORK 

This application claims priority to U.S. Provisional 
Application Serial No. 60/110,902, filed on Dec. 4, 1998. 5 

STATEMENT OF GOVERNMENT RIGHTS 

This invemiou was made with Government support under 
grants NCC5-101 and NCC5-305 awarded by the National 
Aeronautics and Space Administration (NASA). The Gov- 10 
eminent has certain rights in the invention. 

BACKGROUND OF THE INVENTION 

1, Field of the Invention 

The method and apparatus of the present invention relate 15 
generally to multimedia content description, and more spe- 
cifically relate to a system for describing streams or aggre- 
gation of multimedia objects. 

2. Description of the Prior Art 

The number of multimedia databases and other archives 20 
or storage means, as well as the number of multimedia 
applications, have increased rapidly in the recent past. This 
is due, at least in part, to the rapid proliferation of digitali- 
zalion of images, video, audio and, perhaps most 
importantly, to the availability of the Internet as a medium 25 
for accessing and exchanging this content in a relatively 
inexpensive fashion. 

It is becoming increasingly more important for multime- 
dia databases, multimedia content archives, Internet content 3Q 
sites and the like to provide interoperable capabilities for 
such functions including query, retrieval, browsing, and 
filtering of multimedia content. There are many new appli- 
cations waiting to emerge when these multimedia storage 
means having multiple modalities are made available online ^ 
for interaction with these applications. Some examples of 
multimedia applications that may benefit from such interop- 
erability include: 

On-demand streaming audio-visual: In addition to video- 
on-demand type capabilities, there is a need to be able ^ 
to browse and access audio-visual data based on the 
parametric values as well as the content. 
Universal access: Due to the rapid advance of pervasive 
computing devices, Internet appliances, eBook and the 
like, there is a growing need for automatic adaptation 45 
of multimedia content for use on a wide variety of 
devices based on a combination of client device 
capabilities, user preferences, network conditions, 
authoring policies, etc. 
Environmental epidemiology: Retrieve the locations) of 50 
houses which are vulnerable to epidemic diseases, such 
as Hantavirus and Denge fever, based on a combination 
of environmental factors (e.g., isolated houses that are 
near bushes or wetlands) and weather patterns (e.g., a 
wet summer followed by a dry summer). 55 
Precision farming: (1) Retrieve locations of cauliflower 
crop developments that are exposed to clubroot, which 
is a soil-borne disease that infects cauliflower crop. 
Cauliflower and clubroot are recognized spectral 
signature, and exposure results from their spatial and 60 
temporal proximity; (2) Retrieve those fields which 
have abnormal irrigation; (3) Retrieve those regions 
which have higher than normal soil temperature. 
Precision forestry: (1) Calculate areas of forests that have 
been damaged by hurricane, fire, or other natural phe- 65 
nomenon; (2) Estimate the amount of the yield of a 
particular forest. 
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Petroleum exploration: Retrieve those regions which 
exemplify specific characteristics in the collection of 
seismic data, core images, and other sensory data. 

Insurance: (1) Retrieve those regions which may require 
immediate attention due to natural disasters such as 
earthquake, 6re, hurricane, and tornadoes; (2) Retrieve 
those regions having higher than normal claim rate (or 
amount) that are correlated to the geography — close to 
coastal regions, close to mountains, in high crime rate 
regions, etc. 

Medical image diagnosis: Retrieve all MRI images of 
brains having tumors located within the hypothalamus. 
The tumors are characterized by shape and texture, and 
the hypothalamus is characterized by shape and spatial 
location within the brain. 
Real estate marketing: Retrieve all houses that are near a 
lake (color and texture), have a wooded yard (texture) 
and are within 100 miles of skiing (mountains are also 
given by texture). 
Interior design: Retrieve all images of patterned carpets 
which consist of a specific spatial arrangement of color 
and texture primitives. 
Due to the vast and continuous growth of multimedia 
information archives, it has become increasingly more dif- 
ficult to search for specific information. This difficulty is 
due, at least in part, to a lack of tools to support targeted 
exploration of audio-visual archives and the absence of a 
standard method of describing legacy and proprietary hold- 
ings. Furthermore, as users' expectation of applications 
continue to grow in sophistication, the conventional notion 
of viewing audio-visual data as simply audio, video, or 
images is changing. The emerging requirement is to inte- 
grate multiple modalities into a single presentation where 
independently coded objects are combined in time and 
space. 

Standards currently exist for describing domain-specific 
applications. For example, 239.50 has been widely used for 
library applications; EDI (Electronic Data Interchange) has 
been widely used for the supply chain integration and virtual 
private network. However, both of these standards are 
essentially adapted for text and/or numeric information. 
Open GIS (geographical information system) is a standard 
for providing transparent access to heterogeneous geo- 
graphical information, remotely sensed data and geoprocess- 
ing resources in a networked environment, but it only 
addresses the metadata. Open GIS has no provisions for 
storing features and indices associated with features. SMIL 
(Synchronous Multimedia Integration Language) is a W3C 
recommended international standard which was developed 
primarily to respond to that requirement, and the MPEG-4 
standardization effort is presently under development to 
address the same issue. The existence of multiple standards 
and/or proposals relating to the exchange of various types of 
information only reinforces the recognition of the need to 
have a uniform content description framework. 

Despite the latest efforts, however, there remains a need, 
in the field of multimedia content description, for solving a 
number of outstanding problems, including: 

the lack of a unified means for describing the multiple 
modalities/multiple fidelities nature of multimedia 
content, 

the lack of a unified means for describing both spatial and 
temporal characteristics among multiple objects; and 

the lack of a means for describing both streams and 
aggregations of multimedia objects. 
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OBJECTS AND SUMMARY OF THE BRIEF DESCRIPTION OF THE DRAWINGS 

INVENTION pjQ i ^ a conceptual view of a conventional content 
It is an object of the present invention to provide a a description system, illustrating an example of a client con- 
multimedia content description system comprising a unified nected to a plurality of information archives through a 
framework which describes the multiple modalities/multiple s network. 

fidelities nature of many multimedia objects, including piG. 2 is a conceptual view of a conventional content 

metadata description of the spatial and temporal behavior of description system illustrating different types of client 

the object through space and/or time. devices connected, through a network, to a collection of 

It is another object of the present invention to provide a different information content sources, 
multimedia content description system comprising a unified 10 FIG. 3 is a block diagram illustrating a preferred data 
framework which describes both spatio and spauotemporal mode i or description scheme (DS) for the multimedia con- 
nature among multiple objects. ten t description framework of the present invention, which 

It is yet another object of the present invention to provide includes at least one InfoPyramid and one inter-object 

a multimedia content description system for describing both js description model. 

streams and aggregations of multimedia objects. FIG. 4 is a graphical representation illustrating a basic 

It is a further object of the present invention to provide a InfoPyramid data model for representing multimedia 

system comprising information archives employing interop- information, formed in accordance with one embodiment of 

crable capabilities for such functions as query, retrieval, ihe present invention. 

browsing and filtering of multimedia content. M FIG. 5 is a logical flow diagram illustrating an example of 
The present invention revolutionizes the access and Inter Object Specification of four objects with spatial 
exchange of varying types/formats of multimedia informa- relationships, formed in accordance with the present inven- 
tion beiween client devices and multimedia storage devices tion. 

by providing a framework for describing multimedia content FIG. 6 is a logical flow diagram showing an example of 

and a system in which a plurality of multimedia storage M i ntcr object Specification of four objects with temporal 

devices employing the content description methods of the relationships, formed in accordance with the present inven- 

present invention can interoperate. In accordance with one ^ on 

form of the present invention, the content description frame- FIG 7 ^ a block diagram illustrating an example of Inter 

work is a description scheme (DS) for describing streams or 0b j ect Specification, formed in accordance with the present 

aggregations of multimedia objects, which may comprise 30 invention, in which data objects are merged and split. 

audio, images, video, text, time series, and various other RG g fa a b , ock d - Ulustra{ing a preferred data 

modalities. This description scheme can accommodate an modd 0f descri scneme (DS) for lhe i n f opyramid( 

essentially limitless number of descriptors in terms of formed in accordance wi th foe presen t invention. 

features, semantics or metadata, and facilitate content-based A . ... * ,. ,-..,<• j e 

... . . • i r FIG. 9 is a graphical representation of the Icfopyramid of 

search, index, and retrieval, among other capabilities, for ™„ , ... . r , f e .,, , 

both streamed or aggregated multimedia objects. 35 F 1 , 0 ' 3 ' lU ?t™ g ™ ^ r fca ^' ™ da J^ trans : 

m , . . , b , . la tions and fidelity summarization within the InfoPyramid 

"The description scheme, in accordance with a preferred &amework) accordill t0 lbc t invcniioQ . 

embodiment of the present invention, distinguishes between . . , . . . . . 

. * ttl j- u- , ii. u- . FIG. 10 is a block diagram depicting a data model or 

two types of mulumedia objects, name y, elementary objects scheme ^ for J cMn *p id Qbject 

(i.e., terminal objects) and composite objects (i.e., non- , ... • _j <■ j l j- . r 

terminal objects). Terminal object are preferably described dewiptan. in accordance with a preferred embodiment of 

i_ i r n j j i . - .t. u- 1 lhe present invention, 

through an InfoPyramid model to capture the multiple „!\ . .„ . 

modalities and multiple fidelity nature of the objects. In ™; X \ 15 • flow dtt S™ 1^*1* « example of a 

addition, this representation also captures features, raodalit y dependency graph of a video clip, in accordance 

semantics, spatial, temporal, and differing languages as 45 mlh one form of lhe P resem «wenuoa 

different modalities. Non-terminal objects may include, for p IG. 12 * a flow diagram illustraUng an example of a 

example, multiple terminal objects with spatial, temporal, or modality dependency graph of an image, in accordance with 

Boolean relationships, and thus allow the description of one form of the P resenl invention. 

spatial layout and temporal relationship between various FIG. 13 is a flow diagram illustrating an example of a 

presentation objects, the appearance, disappearance, forking 5 o modaUt y dependency graph of a speech clip, in accordance 

and merging of objects, etc. with on « fonn of the present invention. 

Both terminal and non-terminal objects preferably form FIG. 14 is a flow diagram illustrating an example of a 

the basis for describing streams or aggregations of multi- modality dependency graph of a text document, in accor- 

media objects. In principle, a stream may consist of one or dance with one form of the present invention, 

more terminal or non-terminal objects with layout and 55 FIG. 15 is a block diagram depicting a full data model of 

timing specifications. Consequently, a stream description is a multi-modal InfoPyramid, which may include one or more 

preferably defined as a mapping of a collection of inter- InfoPyramid association objects, each object comprising one 

object and intra-object description schemes into a serial bit or more modalities. 

stream. An aggregation, in contrast, preferably consists of a FIG. 16 is a graphical view illustrating an example of 

data model/schema, occurrences of the objects, indices, and 60 possible predetermined modality and fidelity associations 

services that will be provided. Both streaming and aggre- for various known devices or device categories, in accor- 

gation are described within the current framework. dance with one form of the present invention. 

These and other objects, features and advantages of the FIG. 17 is a flow diagram illustrating a preferred method 

present invention will become apparent from the following of the present invention for describing multimedia content 

detailed description of illustrative embodiments thereof, 65 from a multimedia content source, including recursively 

which is to be read in connection with the accompanying transforming the multimedia content according to the 

drawings. InfoPyramid data model. 
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FIG. 18 is a flow diagram illustrating a preferred method terms of either streams or aggregations. With reference to 

of the present invention for synthesizing multimedia FIG. 3, the MMCDF 301 preferably distinguishes between 

content, including combining multimedia content compo- terminal and nonterminal objects. A terminal object is pref- 

nents from target multimedia devices into a composite erably defined as an elementary object or relationship, which 

multimedia object. S may correspond to a real world object, concept, 

FIG. 19 is a flow diagram illustrating an example wherein phenomenon, or other representation. A composite (or 

a search engine transforms a user query into a plurality of nonterminal) object preferably uses one or more terminal 

different queries, each query satisfying the constraints of a objects as building blocks to describe and define more 

corresponding multimedia source, in accordance with one complex objects, relationships, or representations, 

form of the present invention. 10 Furthermore, nonterminal objects preferably use additional 

spatial, temporal, Boolean rules, or the like to capture the 

DETAILED DESCRIPTION OF THE spatial, temporal, Boolean, or other relationships between 

PREFERRED EMBODIMENTS multiple terminal or nonterminal objects, or any conceivable 

FIG. 1 depicts a conceptual view of a conventional u combination of terminal and nonterminal objects, 

content server environment including one or more informa- Differing information content, and their representative 

lion archives or servers 102 and one or more clients 103. The data formats, (e.g., video, images, audio, text, etc.) captured 

information archives 102 and the clients 103 are typically during the same event, or otherwise relating to the same 

interconnected through a network 101 for accessing and object or expression, may preferably all belong to the same 

exchanging data therebetween. terminal object. It should be appreciated that a given mul- 

In the content server system of FIG. 1, a client 103 timcdia content source includes one or more terminal 

generally initiates a specific request or query. The request or objects. By way of example only, consider the following 

query is then sent to one or more information archives 102. events and corresponding information content: computer- 

Thc decision as to which of the one or more archives 102 ized tomography (CT or CAT), magnetic resonance imaging 

will receive the client request or query is determined by the ^ (MRI), ultrasound, and digital x-ray data taken from the 

client 103 or, in some cases, it may be determined by the same patient during a particular examination may all belong 

network 101. The archive(s) 102 receiving the client request to the same terminal object; Satellite images of a particular 

then acts on the request, and may potentially issue its own location captured by Landsat-Thermic Mappers (TM)/ 

request to one or more other archives 102. After the client Multispectral Scanners (MSS) (both of which are image 

request or query has been serviced, the results arc summa- 30 acquisition modules on the Landsat platform), Advanced 

rized (e.g., by the appropriate servicing archive) and routed Very High Resolution Radiometer (AVHRR) data, a map 

to the client 103. Likewise, the client 103 may receive relating to the same area, digital elevation map (DEM), and 

results from multiple archives 102 and summarize the results measurement data from weather stations may constitute a 

locally. In the conventional content server environment, it is terminal object; Core images, seismic survey data and 

essential that interoperability exists between the client 103 3S sensory data (such as the Formation Micro Imager (FMI) 

and an archive 102, as well as between the archives 102 and other logging data) collected during the drilling of a 

themselves (assuming more than one archive is employed) borehole for gas/oil exploration may belong to the same 

so that the metadata (i.e., data that describes what is con- terminal object. 

tained in each archive) can be interpreted by all components Prcfcrab i y> a representation for a terminal object captures 

of the system. Furthermore, interoperability is essential on ^ subslantially all poss ible modalities (e.g., features, 

features (in the image and video archive case), feature characlcristicS( semantics, metadata, etc.) that may arise in 

indices (if there is a need to access high-dimensional feature different appUcations or evenls , WiUl mntiaacd reference to 

space) and semantics. FIG. 3, in accordance with the present invention, a reprc- 

FIG. 2 depicts another conventional content server sentation model or description scheme (DS), defined herein 

example, in which the content exchange system contains one 45 as an InfoPyramid 302, is provided for describing: 
or more content server/archives 202 and a variety of client 

devices, for example, a general purpose computer 205, a ™ c dala modcl uscd ia lhis terminal object (this can be 

laptop computer (not shown) with potentially different band- described, for example, by an XML/RDF schema); 

width load, a portable computing system (PCS) 206 and a Individual modalities, such as images, video, audio, text 

personal digital assistant (PDA) 207. These devices are 50 (potentially in different languages, metadata, features, 

interconnected through a network 201, as similarly shown in etc.), with each modality comprising one or more 

FIG. 1. In order for the same content to be displayed on fidelities (e.g., resolution, quality, color, etc.); and 

different platforms, the content format must first be adapted Additional modalities, such as the spatial characteristics 

or converted. This adaptation is necessary since the display (e.g., the location/position in (x,y) or (lattitude, 

and processing capabilities of the various client devices 55 longtitude)) and spatio-temporal behavior (e.g., the 

employed with the content server system may differ widely. trajectory) of the object. 

Content adaptation may take place on the content server 202 Preferably, the multimedia content description framework 

prior to being transmitted to the appropriate client device. (MMCDF) of the present invention further provides an Inter 

Similarly, content adaptation may be performed by a proxy Object Specification 303 (IOS) framework or description 

203. Furthermore, content adaptation may be performed by 60 scheme to describe both spatial and temporal relationships 

a client content adaptation/filter 204, and the modified among multimedia objects, as well as to specify inter-object 

content subsequently transmitted to the proper client device user interactions. This framework 303 allows the specifka- 

(s). lion of the semantics and syntax for combining media 

In accordance with one embodiment of the present objecls into composite objects (or nonterminal objects). A 

invention, a multimedia content description framework 65 detailed discussion of the IOS 303 and InfoPyramid 302 

(MMCDF) is provided that solves the problem of describing representations are provided herein below with reference to 

various types of multimedia information in digital form in FIGS. 3 and 4. 
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InfoPyramid for Intra-Object Specification 

Multimedia content typically does not exist as a single, 
homogeneous media format or modality. Consider, for 
example, that a video clip may include raw data from a video 
source, as well as audio (possibly in multiple languages) and 
closed captions. As a further example, consider a medical 
environment, wherein MR I, CT, PET, and ultrasound infor- 
mation can be collected for the same patient, thus resulting 
in multiple three-dimensional (3D) scans of the same or 
similar content. Consequently, each terminal object in the 
multimedia content description framework (MMCDF) of the 
present invention is preferably defined by a data structure 
InfoPyramid 302, a preferred embodiment of which is 
detailed in FIG. 4. As shown in FIG. 4, the InfoPyramid 



adequately described. For example, a video clip may be 
transformed into images showing key frames. Likewise, text 
can be synthesized into speech, and vice versa. Furthermore, 
since the content description framework of the present 
invention is recursive, multiple transformations may be 
performed within the same non-terminal object, either 
between two or more different modalities, or between two or 
more different fidelities, or a combination thereof. 

With reference to the InfoPyramid example of FIG. 4, 
possible modalities 402 may include, but are not limited to, 
text 406, images 404, audio 405 and video 403. A preferred 
embodiment of the InfoPyramid data model is shown in FIG. 
8, and preferably includes two broad categories of data, 



describes content in different modalities 402 (e.g., video is namely, non-structured data and semi-structured data, as 



403, audio 405, text 406, etc.) and at different fidelities 401. 
Preferably, the highest resolution/fidelity level is represented 
along the base of the pyramid, with the lowest level of 
resolution/quality being represented at the top of the InfoPy- 
ramid model. Furthermore, the InfoPyramid of the present 20 Non-structured and Semi-structured Data 
invention preferably defines methods and/or criteria for 



discussed in more detail herein below. It should be 
appreciated, however, that additional categories or types of 
data may be represented by the content description frame- 
work of the present invention in a similar manner. 
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generating, manipulating, transcoding and otherwise trans- 
forming the source multimedia content as desired, or as 
suitable for a particular target platform, device, or class of 
devices. 

As mentioned above, in addition to being comprised of 
multiple modalities, each content component may also be 
described at multiple fidelities. Here, fidelity may refer not 
to the formal of the information, but rather to the appearance 
or quality of the information. For example, fidelity may 
include the resolution, number of colors, sharpness, frame 
size, speed, etc. of video/image content; the audio quality 
(sampling rale), pitch, etc. of audio content; or the language, 
font style, color, size, summarization, etc. of text content, as 
appreciated by those skilled in the art. Numerous resolution 35 
reduction techniques are known by those skilled in the art for 
constructing image and video pyramids. For example, Flash- 
pix is a commercially available application which provides 
mechanisms for storage and retrieval of still images at 
multiple resolutions. Likewise, features and semantics at 40 
different resolutions are preferably obtained from raw data 
or transformed data at different resolutions, thus resulting in 
a feature or semantics pyramid. 

Preferably, each device or class of devices can be repre- 
sented by a different InfoPyramid. Alternatively, an InfoPy- 45 
ramid may be used to describe all of the modalities and 
fidelities required by a particular multimedia system. As an 
example, consider a personal data assistant (PDA), wearable 
on a user's wrist which includes an LCD capable of dis- 
playing only text or very low resolution images. In accor- 50 
dance with the multimedia content description framework 
(MMCDF) of the present invention, an InfoPyramid model 
representing this PDA device may include only text and 
image modalities and relatively few fidelity levels. An 
InfoPyramid representation of the multimedia source would 55 
then preferably be transformed into the InfoPyramid repre- 
sentation of the target device, using known transformation 
schemes, prior to displaying the multimedia information on 
the target device. It is to be appreciated that the present 
invention contemplates that such content transformation 
may take place either at the multimedia source, at the target 
device, or at any suitable point therebetween. 

Occaisonally, an appropriate multimedia content modality 
may not exist to appropriately describe the multimedia 
content. In some cases, the required modality may be 
synthesized by transforming and/or combining other exist- 
ing modalities and/or fidelities until the desired content is 
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Text description scheme (DS) 807: This modality is 
preferably the free text or structured text description 
(using HTML or XML, for example) within an object. 
Note, that an object may contain text in different 
languages, and each language will preferably constitute 
a different modality. 
Image DS 804: Images are generally RGB or 
muliispcctral, such as those acquired from satellite 
images, although virtually any image format may be 
described according to the present invention. There 
may exist multiple image modalities, depending on the 
application. Images may be stored as raw data or in a 
transformed format (e.g., blocked DCT as used in 
JPEG). 

Audio DS 805: This modality preferably captures audio 
information, including speech, natural sounds, music 
and the like. 

Video DS 803: This modality preferably captures video 
and visual information 

Feature Descriptors: These modalities preferably include 
textures, color histograms, shapes, from both still 
images and video, as well as motion derived from 
video. Note that features can be derived from either the 
raw data or the transformed data. 

Semantics and object DS 808: Typical semantics and 
object descriptions may include, for example, houses/ 
trees from a still image, an anchorwoman from news 
video cUps, and forest from satellite images. These 
semantics and object descriptions can be either auto- 
matically or manually derived from features, raw data, 
transformed data, or the like. 

Annotations, metadata, and additional DS 809: These 
modalities preferably provide global descriptions of the 
content, including, for example, the author/publisher, 
date, location of event, etc. 

Spatial-temporal behavior DS 802: This modality prefer- 
ably describes spatial characteristics of an object or 
event as a function of some other measurable quantity 
or relationship, such as time or distance (e.g., size as a 
function of time, or location as function of time). 

Temporal behavior DS 802: This modality preferably 
describes the temporal behavior of non-spatial 
characteristics, such as, for example, intensity as a 
function of time. 
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Also shown in FIG. 8 are Info Pyramid DS 801 and With reference to FIG. 9, for each modality of the 

VRML DS 806. multimedia content, the original modality at the original 

Structured Data resolution is preferably selected as a root entity in the 

Structured data preferably describes the structure/ description. When either a different modality or a different 

organization of an object or other entity and may include 5 fidclity/rcsohition is derived from the existing modality 

generic tables, relational tables, and spreadsheets (e.g., and/or resolution, the new modality and/or resolution is 

Excel, Lotus 1-2-3, etc.), among other suitable representa- defined as the child of the original modality or resolution 

lions. For example, the structured data description may (with the original modality or fidelity preferably being 

describe the structure of a table, such as the attributes of the referred to as the parent) and is preferably recorded on the 

columns, and the data distribution of each column. 1Q mo dality dependency entity description. 

It should be appreciated that in accordance with the Preferably, each connection between adjacent nodes (i.e., 

present invention, additional modalities may be added as modalities md/oT fidelities) corresponds to a transformation, 

necessary. For example it may be desirable to describe an ^ betwccn lWQ differem modalities (u horiz0 ntal 

object in both the English and French languages, each of tansformalion)> or ^iwcn two different fidelities (i.e., 

which may occupy a modality. Some of the modalities are ^formation). It should be appreciated that diago- 

more suitable for indexing purposes rather than for browsmg 15 . ^ " " . ^ 2T SW " SU r: ? * e 

purposes, such as a features modality. This preferred decom- transformations may be denved from a combination of 

position of multimedia content description allows better horizontal or vertical transformations using the InfoPyramid 

flexibility of index and retrieval of objects) and composite modcl - A ncw cntlt y ma y Preferably be denved using one or 

objects at multiple abstraction levels. Each of these modali- more of toese transformations according to predetermined 

ties can be represented at multiple resolutions or fidelity 20 ™les corresponding to each connection on the modality 

levels. While the concept of multiple resolutions for images dependency entity description. Alternatively, the present 

and video based on various pyramidal coding is well known invention similarly contemplates the use of diagonal trans- 

by those skilled in the art, this concept has not been applied formations (i.e., a transformation between two modalities 

to text and other modalities. and two fidelities), or any other conceivable translation/ 

In accordance with a preferred embodiment of the present ^ transformation known by those skilled in the art. The 

invention, the spatial or temporal location of a particular method and/or rule that is used to derive the new entity is 

feature or semantics preferably inherits the location of the also preferably recorded as part of the relationship. This 

data, transformed data, or features that this feature or seman- process is preferably repeated for every original modality 

tics is derived from. Neighboring features (both spatially and / 0 r fidelity in the multimedia content, 

and temporally) that are "similar" are preferably grouped 3Q M discussed above, there are essentially two types of 

together to form a region. A region can be defined by its mcthod/ru i cs wb j c h may be utilized by the present invention 

minimum bounding box, a polygon, a quadtree, or any other ^ derfve a new eQti 

suitable representation known in the art. The location of this k . . ... _ , . , on . nn - nA _. e 

representation is preferably inherited from the features Modahty Transhnon (e.g., 901 902, 903): Th^ transfer, 

before grouping; Similarly, neighboring semantics (both matl0n includes > for , exam P le - text (0 audl0 > vldeo lo 

spatially and temporally) that are "similar" arc preferably * O^g keyframe extraction), image to text 

grouped together to form a region. Similar techniques used ima g e recognition techniques or annotations by 

to represent features can be adopted here to represent the a human being), etc.; and 

location of an object, event, etc. Fidelity transformation (e.g., 904-909): Thus transforma- 

In one multimedia application relating to the Motion tion includes, for example, lossy compression of 

Picture Experts Group (MPEG) data compression standard, 40 images, audio and video, rate reduction, color depth 

for example, a main difference between the InfoPyramid reduction, and text summarization. Techniques are also 

approach of the present invention and other conventional known by those skilled in the art to summarize videos 

schemes proposed for MPEG-7, is the virtually complete through the extraction of a storyboard, or the construc- 

climination of the dichotomy between data and metadata. In tion of a scene transition diagram, 

practice, it has become harder to distinguish between trans- 45 It should be appreciated that the example transformations 

formed data and features. For instance, wavelet coefficients, described above are furnished only to illustrate preferred 

such as those based on quadrature mirror filters and Gabor modality-fidelity transformation rules, and are not intended 

filters, have been used for both transformations as well as lo limit the scope or application of the present invention, 

feature extractions. Consequently, a data model that cao Other transformation schemes may also be employed with 

accommodate both data and metadata in a seamless fashion 50 the present invention. 

is extremely desirable. InfoPyramid accommodates both raw By way of example only, FIG. 11 illustrates one method, 

data and transformed data as one of the modalities, thus wherein a video modality can be converted into Key Frame 

eliminating possible asymmetry introduced by restricting or Story Board modality, which can then be converted to a 

the data model to only metadata. text modality, and possibly converted to a speech modality. 

One of the primary challenges faced by a content descrip- SS For each of these modalities, different levels of detail (or 

tion framework that is capable of processing multimedia fidelity) can be represented. Many of these conversions are 

information comprised of various modalities and/or fideli- preferably done automatically, and thus the «generate» 

tics is the synchronization of feature and semantics descrip- action in the dependency graph will preferably capture and 

tions among the different modalities and/or different fideli- record the method that is used to transform from one 

ties. These challenges have been solved by the present 60 modality to another (or one fidelity to another). These 

invention which, in a preferred embodiment, provides, conversions may also be done manually, in which case the 

among other things, a modality dependency entity- «generate» phrase preferably only illustrates the depen- 

relationship, preferably generated for each of the original dency of the data entities. Note that both key frame and text 

modalities and resolutions/fidelities (except metadata), and can be converted to graphics. Converting images to graphics 

stored as part of the description scheme. An example of a 65 typically involves the extraction of edges from the images, 

modality dependency entity description is illustrated in FIG. By using an iconic vocabulary, it is also possible to convert 

9. text, text phrases, text sentences and the like to graphics. 
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FIG. 12 illustrates an example of multimedia content level and an audio modality at a slightly higher fidelity. 

wherein an original modality is an image. Text descriptions Similarly, a presentation for a Psion™ device 1602, may 

can be generated, in accordance with the present invention, include text, image, and audio modalities at even higher 

for the images in terms of the objects inside the image as fidelities; A presentation for color WinCE devices 1603, may 

well as the metadata of the image. These text descriptions S include just text and image modalities at the same fidelity as 

can then be converted to speech using transformation rules for a Psion™ device. Accordingly, a full presentation 1604 

known in the art. Both the image and text representations [°[ a general purpose personal computer (PC) connected at 

can be converted lo graphics, which captures the "essence" nct ™ rk bandwidth, for example wfll preferably include 

of the image, again using any known transformation tech- J? f &>**' image ' ^V nd T^i 11 ,hC 

oiques. For each of these derived modalities, multiple fideli- ,0 ^, hcSt fidcUtlCS SU PP ortcd b * thc ™ ltimcdia »»" 

. . t tent source. 

| r .« ..... , . Since it is virtually impossible to predict new types/ 

FIG. 13 illustrates an example of the multimedia content formats of data Qr ^ amaUss a lications ^ such daU 

wherem an ongmal modahty is speech. Speech can be which may arise> m ^pca of the present inven- 

converted to text through speech recognition actions, as Uon ^ it5 ability to dynamically cxtcn d the multimedia 

appreciated by those skilled in the art. The text can then be 15 content description framework to accommodate such new 

converted to graphics (such as animated cartoons, etc.) using content as it is encountered. In accordance with the present 

known transformation rules/methods. As in any of the invention, content descriptions are preferably defined from 

modality transformation examples, each modality can an extensible set of description data types. The descriptor 

include multiple fidelities. schemes are preferably defined by specifying sets of content 

FIG. 14 illustrates an example of multimedia content 20 descriptors and methods to facilitate indexing, searching and 

wherein thc original modality is text. Text can be converted comparing instances of the content. A more detailed discus- 

to speech through text-to-speech synthesis, among other sion of description data types is provided herein below. 

suitable methods known to those skilled in thc art. Text can Description Data Types 

also be converted to graphics (e.g., cartoon). Similar to In accordance with the present invention, the multimedia 

previous examples, each of the modalities can also include 25 content descriptors arc preferably defined from fundamental 

multiple fidelities. description data types and/or user-derived data types. The 

With reference now to FIG. 15, a preferred data model of multimedia content description system of the present inven- 

a multimodal InfoPyramid, in accordance with the present tion preferably provides a fundamental set of primitives and 

invention, is depicted. Preferably, in addition to the modality extended data types. In applications, thc content descriptors 

dependency entity relationship, a modality association 30 are defined as instances of these fundamental data types, 

entity-relationship 1502, 1507, 1508 exists in the InfoPyra- Furthermore, the multimedia content description system of 

mid data model to describe the association of a subset of the 0ie present invention preferably provides the mechanism for 

modalities at a given resolution (or fidelity). Each associa- the user lo derive new data types by extending the funda- 

tion is preferably a collection of the modalities that are mental data types. 

individually suitable to be presented (i.e., displayed) on a 35 As an example, a preferred fundamental set of primitives 

target platform (e.g., for a given device or range of devices, may include binary, integer, real, element, set, relationship, 

a given bandwidth or range of bandwidths, a given user etc. Similarly, an example of extended description data types 

interest or range of user interests). The baseline association of thc multimedia content description system may include 

1509 constitutes the original collection of modalities for the array, vector, histogram, point, line, path, rectangle, 

multimedia document. These modalities can include, for *o polygon, shape, sequence, character, term, text, concept, 

example, as shown in FIG. 15; Video-i 1503, Audio-i 1504, composite, dictionary, thesaurus, ontology, and others as 

Image-i 1505, and Text-i 1506. Also shown in FIG. 15 is may be defined by the user. 

Info-Pyramid DS 1501. Additional associations may repre- In creating content description instances, many of the data 

sent collections of modalities that can be progressively types, T, preferably utilize modifiers, for example, of the 

retrieved, or that arc suitable for presentation when thc 45 formT(t)[l], which may specify that T contains 1 element 

bandwidth is insufficient, or when the platform imposes of type l. As another example, in accordance with the present 

severe constraints on multimedia content presentation (e.g., invention, an n-dimensional vector in integer space may be 

Palm, PDA devices, screen phones, etc.). Contextual infer- represented as vector(integer)[n]. Each description data type 

mation for each association is preferably recorded in thc preferably contains methods to construct, compare, and 

modality association relationship. so destroy description objects of that data type. 

It is to be appreciated that object descriptions from Derived data types, D, are preferably derived from fun- 
different modalities can be associated, and the descriptions damenlal data types. For example, a derived type D which 
of these associations similarly stored in the InfoPyramid is derived from type T may preferably be defined as D:[T]. 
association DS. As an example, consider the case where In general, derived data types are often most useful when 
object 1 in an image description refers to object 2 in a video 55 combining fundamental data types as, for example, D:[T1, 
description, and object 3 in a text description occurs before T2, T3, . . . \ where D is the data type derived from 
object 4 in a video description. These associations can assist fundamental data types Tl, T2, T3, etc. For example, 
the transcoding process when the right set of modalities consider the following definition of the derived data type 
must be selected. "deformation": 

In FIG. 16, examples of various predetermined 60 dcformation:[sequence(shapc)[N],path] 

associations, as represented on the InfoPyramid model of the In this definition, the derived data type deformation, is 

present invention, arc shown for some popular multimedia preferably derived from a sequence of shape transformations 

platforms or devices. It should be appreciated, however, that specified by "sequence(shape)[N]", as well as a translation 

thc present invention is not limited to those precise embodi- through a path specified by "path", 

ments shown in FIG. 16. With reference to FIG. 16, a 65 Standard Descriptors 

presentation for a Motorala StarTac™ device 1601, for In accordance with a preferred embodiment of the present 

example, may include a text modality at the lowest fidelity invention, a set of standard descriptors have been developed 



04/28/2004, EAST Version: 1.4.1 



US 6,5( 

13 

for images and videos across several search and retrieval 
applications. These descriptors preferably describe various 
visual features, such as color, texture motion, etc For 
example, 166-bin color histograms derived from HVS color 
space may preferably be defined as: 

H VShist:histogram(realX 166] 
Two descriptors arc preferably utilized for texture. By way 
of example only, consider the descriptors 

QMFtexture:veclor(real)[9]; and 

conglomtexture: vecto r(re alX20], 
where QMPtcxture is preferably defined by the spatial- 
frequency energies on 9 subbands of the QMF wavelet 
transform of the image and conglomtexture is preferably 
defined from a conglomeration of 20 texture features which 
are suitable for querying-by-texture of satellite imagery. 
Description Functions 

The content description system of the present invention 
preferably defines a fundamental set of description functions 
which operate on the description. At least one primary 
purpose of the description functions is to facilitate the 
comparison of description values, which enables searching, 
indexing, retrieval, among other contemplated functions, of 
the source multimedia content. 

The fundamental description functions preferably com- 
prise several classes, such as logic, similarity and transform 
functions, among others. Fundamental logic functions con- 
templated by the present invention may include, for 
example, "equals", "not equals", "great er-tban", "less-than", 
"and", "or", "not", etc. as known by those skilled in the art. 
The logic functions preferably perform binary operations, 
although the present invention similarly contemplates other 
suitable types of logical operations. The similarity functions, 
on the other hand, preferably return a score. Suitable fun- 
damental similarity functions for use with the present inven- 
tion include, for example, "walk", "Euclidean", 
"chessboard", "quadratic", "hamming", and others known 
by those skilled in the art, which define standard mathemati- 
cal formulas for computing distances. 

As can be appreciated by those skilled in the art, trans- 
form functions essentially define operations on the descrip- 
tion which transform it in some way. For example, transform 
functions can define the relatioaship between one descrip- 
tion type and another standard description type. For 
example, consider the standard description type 

rgbhist:bistogram(integer)[5 1 2] 
Given this descriptor, which preferably defines rgbhist as a 
512-bin histogram in RGB color space, another derived 
description type may be declared, such as 

myh ist : histogram( inleger)[5 1 2], 
which may define a color histogram in a different color 
space. Assuming the new color space is derived from the 
RGB color space, then myhist may be obtained via a 
transformation, F, of rgbhist, which may be represented as 

myhist»F(rgbhist). 

One skilled in the art can appreciate the importance of the 
above transformations in conducting queries across multiple 
archives of multimedia content, as illustrated in FIG. 19. For 
example, as illustrated in FIG. 19, each archive 1903, 1904, 
1905 may utilize a different color histogram description. In 
order for the search engine to query the multiple archives 
given a single query color histogram Q (preferably gener- 
ated by a user 1901), the search engine 1902 must transform 
that query histogram Q into the appropriate histogram color 
spaces of the particular archives 1903, 1904, 1905 (i.e., 
F1(Q), F2(Q), and F3(Q), respectively). Content-based 
searching across multiple archives requires transformations 
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of the query histogram Q to be compatible with the specific 
content descriptions in each archive. 
Extensible Markup Language (XML) Representation 
With reference again to the MPEG-7 example, it is 

5 preferable that a representation for MPEG-7 data, or the 
InfoPyramid data abstractions, be at least readable, portable, 
standard and extensible. Accordingly, XML is preferable for 
use with the present invention as the basis of the represen- 
tation language (InfoPyramid Description Language or 

10 IPDL) for InfoPyramids, although the present invention 
contemplates using any suitable equivalent language. As 
appreciated by those skilled in the art, XML is a tagged 
markup language for representing hierarchical, structured 
data. XML is, by definition, extensible and has the advan- 

15 lages of being portable (It is essentially independent of the 
underlying machine, operating system/platform, program- 
ming language, etc.) and is easily readable by both machines 
and humans. 

In addition to (he above features, XML has a strong 
20 linking model. This feature of XML is useful for specifying 
and maintaining relationships between different modalities 
and versions, etc., of content. This linking mechanism also 
makes the representation independent of the underlying 
storage medium. For example, videos may reside on a video 
25 server, text transcript may reside in flat files for a text index 
and metadata may reside in a relational database. The 
linking mechanism of XML will make such storage trans- 
parent. 

Descriptor Extensibility 
30 New descriptors can be defined in XML by specifying the 
base class types and compare methods. As an example, 
consider the following specification of a new color histo- 
gram description class: 

<IPMCD classname="myhist" baseclass="histogram 
35 (real)[64]" compare="Euclidean" owner="address" 
spec="address"> </IPMCD>, 
which defines the descriptor class "myhist" which corre- 
sponds to a 64-bin histogram which utilizes the Euclidean 
distance metric to compare myhist descriptions. The myhist 
40 content description instances may be specified as: 

<IPMCD id-999999 myhist -"832034 11242342342 . . . 
"> </IPMCD> 
Descriptor Schemes 
The multimedia content description language preferably 
45 enables the development of descriptor schemes in which a 
set of content descriptors and their relationships are speci- 
fied. For illustration purposes only, consider the following 
example: 

5Q <IPMCD classname«"colorregion" baseclass="myhist, 
shape" 

comparc="0.6*myhist.Euclidcan+0.4*shape.walk"> 
</IPMCD> 

<IPMCD classname="regionset" baseclass«"set 
55 (colorregion)[N]" 

corapare-"sum(n-0;N-l)(colorregion.compare)"> 
</IPMCD> 

The domain of MPEG-7 descriptors is very large. An 
investigation of early proposals for MPEG-7 show that a 

60 large number of features and metadata have already been 
suggested, and this list is only growing to increase. Most of 
these arc specific to particular media objects or application 
domain. XML includes an excellent mechanism, the Docu- 
ment Type Definition or DTDs which make it possible to 

65 manage the plethora of meta-data and feature descriptors by 
DTDs which support the subset for a particular media or 
application. The DTDs also makes it easy for a particular 
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community (say Satellite Imagery vs. News videos) to share 
and conform to a specific set of MPEG7 descriptors by 
subscribing to a common set of DTDs. 
Inter Object Specification QOS) 

FIG. 10 illustrates an Inter Object Description Scheme 
(10DS), formed in accordance with a preferred form of the 
present invention. With reference to FIG. 10, the IODS 1001 
preferably accommodates a number of description schemes 
that describe the relationships among objects at various 
levels of abstraction. Furthermore, the IODS provides a 
mechanism for describing object compositions, starting 
from the terminal object (i.e., the InfoPyramid 1005), and 
provides descriptions for inter-object relationships, 
including, for example, temporal 1002, spatial 1003, spatio- 
temporal 1004, hyper-linking (not shown) and others known 
in the art. The objects referred to in an IODS can be 
elementary objects or composite objects. Traditional media 
objects and InfoPyramid objects arc preferably treated as 
elementary objects (described herein below). An important 
characteristic of an IODS is its ability to handle lime and 
space flexibly. 

Representing Temporal Relationships 

Temporal relationships among objects are preferably rep- 
resented in the IOS, in accordance with the present 
invention, by a set of temporal elements, such as, for 
example: 

meet, 

co-begin, 

co -end, 

co-occur, 

with each set of temporal elements preferably describing the 
corresponding relationship among related objects. The rela- 
tionship i4 mcct", for example, may be used to sequence 
objects, "co-begin" may be used to align start times, "co- 
end" may be used to align end times, "co-occur" may be 
used to align both start and end times, etc. For example, the 
relation rneet(a, b) preferably describes a sub-scene, where 
object a is immediately followed by object b; co-begin(a, b) 
may describe a sub-scene where objects a and b start 
together; co-end(a, b) may describe objects a and b ending 
together; co-occur(a, b) to describe objects a and b both 
starting and ending together. 

Additional temporal constraints between pairwise objects 
may include, but arc not limited to: 

followed by, where the starting time of one object is larger 

than the ending time of the second object; 
precede by, where the ending time of one object is smaller 

than the starting time of the second object; 
immediately followed by, where the starting time of one 
object is substantially equal to the ending time of the 
second object; 
immediately preceded by, where the ending time of one 
object is substantially equal to the starting time of the 
second object; 
start after, where the starting time of one object is after the 

starting time of a second object; 
end before, where the ending time of one object is before 

the ending time of a second object; 
overlap, where the duration of two objects overlap; and 
contain, where the duration of one object contains the 
other object. 

It should be understood that the present invention further 
contemplates the concatenation (using concatenation func- 
tions known by those having skill in the art) of one or more 
temporal relationships to describe additional, more complex 
relationships. 
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The temporal elements provided by the present invention 
are similar to what the Synchronized Multimedia Integra- 
tion Language (SMIL), which allows for the creation of 
lime-based multimedia delivery over the web, provides as 

5 synchronization elements, namely, "seq" and "par" 
(sequential and parallel elements, respectively). However, 
IOS provides a set of more general temporal relationships 
from which simple "seq" and "par" elements can be gener- 
ated. Furthermore, unlike SMIL, IOS allows built-in flex- 

10 ibility in the object's duration. The flexibility in IOS stems 
primarily from its spring-like object model, where each 
object, either elementary or composite, is preferably treated 
as if it was a spring. This preferred representation of the 
present invention is described in more detail herein below. 

15 Representing Objects as Springs 

In the IOS scheme of the present invention, each object 
preferably has associated with it a triple of minimum, 
optimal, and maximum lengths, specified by the author or by 
the system. As such, an object's duration is preferably 

20 specified as a range bounded by a minimum and a 
maximum, and including the object's optimal length. For 
example, a certain video clip has a certain duration when 
played at the speed it was captured at (e.g., 30 frames per 
second). The multimedia content description framework 

25 (MMCDF) of the present invention preferably allows 
authors to define a range in the playback speed, for example, 
between 15 frames per second (slow motion by a factor of 
2), and 60 frames per second (fast play by a factor of 2). For 
the exemplary video clip, this results in a maximum and 

30 minimum total playback duration, respectively. Note, that it 
is still possible to dictate only one specific playback duration 
(which is directly related to the playback speed in the case 
of video, audio, or animation), by restricting the duration 
range to a width of zero. 

35 Such spring-like objects can then be connected using the 
temporal relationships described herein above. This 
connected -spring model preferably allows built-in flexibility 
in the delivery system. For instance, time stamps are not 
required to be fixed, and hardwired. Rather, time stamps may 

40 be coded, reflecting the given relationships and the spring 
properties of the corresponding objects. Furthermore, an 
acceptable range of playback times of an object can be 
exploited by the playback (or delivery) system to account for 
network delays, processor or peripherals speed, etc. 

45 This flexible time stamping mechanism of the present 
invention has been proposed for MPEG-4 to optionally 
provide the MPEG-4 delivery system with a mechanism for 
achieving adaptive temporal synchronization. The motiva- 
tion of having such an extension is that in environments with 

50 unreliable delivery, the presentation of multimedia objects 
may falter due to the missing of time stamp deadlines. The 
new, more flexible timing information will have at least two 
features. First, instead of fixed start and end times, the 
presentation duration of an object (an elementary stream) 

55 may be given a range. Second, the start and end limes are 
preferably made relative to the start and end limes of other 
multimedia objects. This information can then be used by 
the client to adapt the timing of the ongoing presentation to 
the environment, while having the flexibility to slay within 

60 the presentation author's expectations. 
Representing Spatial Relationships 

In a similar manner to the representation of temporal 
relationships, spatial properties can also be specified using 
relationships, and acceptable ranges, in accordance with a 

65 preferred embodiment of the present invention. For instance, 
a two-dimensional (2D) rectangular object may have asso- 
ciated with it a triple of minimum, optimal, and maximum 
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area requirements, along with minimum, optimal and maxi- in many applications, such as medical, astrophysical and 

mum aspect ratios. remote sensing scenarios. 

Possible spatial relationships among multiple objects can Solving Temporal/spatial Constraints 

also be specified using constructs such as: Both temporal and spatial constraints can be resolved 

left-align; 5 using a constraint solver, as appreciated by those skilled in 

ruml-alum: 100 art ' an( ^ temporal/spatial schedules can be computed 

automatically if there is a solution that satisfies all the 

top-align, and constraints. Suitable techniques for solving temporal and 

bottom-align. spatial constraints are well known in the art. Accordingly, a 

In addition, spatial relationships between pairwise objects 10 detailed discussion of such will not be presented herein, 

may include: Stream Description Scheme (SDS) 

top-of; A stream description, as defined by the present invention, 

bottom-of; is preferably a mapping from an elementary (or terminal) 

adjaceni/neighooring- ob j ecl or a ^P 05 ^ ( or nonterminal) object to a serial 

15 logical bit stream. Since this description is logical, its 
near/close by, mapping to the specific protocol is undetermined. It is to be 
within/contained; appreciated that the bit stream may be transmitted via any 
north of; suitable medium, including dedicated data lines or wireless 
south of; communication channels, such as cellular, satellite, 
cast of; and 20 microwave, or electromotive force (EMF) networks. Pos- 
wcst Q £ siblc mediums/protocols which may be employed for car- 
Note, that spatial operators may refer to both 2D objects J*** the steam may include^ but are not limited to hyper- 
(such as images) and ZD+time objects (such as video, ^transport protocol (HTTPX Internet protocol (IP) TCP, 
presentation, etc.). Consequently, these spatial relationships Flber . ChanneI 112 chss 1 ° r 2 ' and 
will have different implications depending on the context of 25 ^ digital television (DTV) transmissions, 
their usage. For 2D objects, the spatial constraints may be u the data description language is not described 
evaluated unambiguously. For 2D+time objects, the spatial herein, many of the description languages that are suitable 
constraints may have to be evaluated in conjunction with the for ™ Wllh * e P resent u,ve ^ on (•*■ ™ L > &h * ad y 
temporal constraints. Accordingly, the spatial relationships P r0Vld 5 a senalizjUoD mapping. *^™nple, the senahza- 
. ■ j -.j -i .u- 30 Uon of an XML descnption over HTTP is well defined, as 
being described may imply, among other things: . , F , " , . . . " , 

T . • i * appreciated by those skilled in the art. In general, the 

the spatial relationships among/between objects during a ^ q{ {hc scrialization may mcludc ^ inleM bjcct 

specinc tune mstant (i.e., snapsno descriptions and InfoPyramid descriptions. As an example, 

the spatial relationships among/between objects during an consider the following: 

arbitrary point in time, . . . 35 stream description SI for inter-object description Oc 

the spatial relationships among/between objects for an Inter-object description Oc for object 012 and object 

entire duration; and/or 023 

the spatial relationships among/between objects during a i ntC r-object description 012 for object 01 and 02 

representative moment, given by a specification of the lnforVamid dcscrip tion for object 01 

InfoPyramid or objects. *o ' . . . 

Furthermore, additional possibilities/relationships that are InfoPyramid description for object 02 

suitable for use with the present invention may exist such Inter-object description 023 for object 02 and 03 

that time is the primary dimension, and as such, temporal InfoPyramid description for object 02 

constraints may be used to derive additional spatial con- InfoPyramid description for object 03 

straints among objects that arc related temporally. The 45 Due to the scoping rule, the description of object 02 is 

interacting of time/space is known in the art and, therefore, repealed inside the inter-object description 023. This may 

a detailed discussion of the subject matter will not be be desirable for minimizing the memory requirement for 

presented herein. performing content filtering or synthesizing final content. 

As an example of a spatial relationship, FIG. 5 shows that Alternatively, it is also feasible to serialize the stream as: 

object A 501 is to the northwest of object B 502; which is to 50 str eam description SI for inter-object description Oc 

the east of object C 503. Furthermore, Object B is within Inter-object description Oc for object 012 and object 

object D 504. Object D is also to the southeast of object A Q23 

501 and to the east of object C 503. Ultimately, a complex , nle r-object description 012 for object 01 and 02 

graph can be potentially represented unambiguously by a set |nter<jbj(5Ct descriplion 0 23 for object 02 and 03 

of pairwise relationships. As an example of a temporal 55 J F J 

relationship, FIG. 6 shows that object A 601 starts after InfoPyramid descnption for object 01 

object B 602, and ends long after object B. Object A starts InfoPyramid description for object 02 

before object C, and ends after C 603. Object C starts after InfoPyramid description for object 03 

object D, 604 and ends after object D. It is to be appreciated that the current stream description 

An example of the life cycle/duration of objects is illus- 60 scheme would permit either serialization approach, since 

trated in FIG. 7. Referring to FIG. 7, two objects, namely, both descriptions are consistent with the scoping rules, 

object A 701 and object B 702, are shown merged at some Aggregation Description Scheme (ADS) 

time into object C 703. Object C 703 may then be subse- An aggregation may be defined as the union of a collec- 

qucntly expanded to object D 704. At that point in time, tion of terminal or nonterminal objects and the access 

object D 704 is subsequently split (or forked) into three 65 methods. Components in an aggregation description 

objects, namely, object E 705, object F 706 and object G scheme, in accordance with the present invention, preferably 

707. This object split and/or merge phenomenon may arise include a descriplion of: 
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Grand schema: This is the catalog of all of ihe data and 
services provided by the aggregation; 

Data description: This description preferably includes all 
of the inter-object specification QOS) as well as 
InfoPyramid intra-object specification. This corte- s 
sponds to the data catalog in the traditional sense, and 
enables the understanding what is contained in the 
aggregation. 

Service description: This describes the services provided 
by the aggregation, including search and retrieval of 10 
data through the specification of parametric data, or the 
search and retrieval of data through similarity/fuzzy 
retrieval using features or semantics, or a combination 
of both. 

Based on the description schemes of the present invention 15 
discussed herein, multimedia content can be either analyzed 
or synthesized according to these schemes. After the 
analysis/synthesis step, which generally comprises an 
assessment of the target/intended audience and associated 
target multimedia devices, the source multimedia content 20 
can then be stored using the MMCDF framework of the 
present invention. This stored content can subsequently be 
used to provide multimedia content to various devices with 
different platforms (as in FIG. 2). 

In accordance with a preferred embodiment of the present 25 
invention, a method is provided for analyzing source mul- 
timedia content, as shown in FIG. 17. With reference to FIG. 
17, this method preferably comprises the following steps: 
1. Analyze the audience composition 1701: Determine the 



8. Repeal steps 5-7 if the current object is not a terminal 
object 1708 until all the multimedia content has been 
analyzed. 

In accordance with a preferred embodiment of the present 
invention, a method is provided for synthesizing a multi- 
media content source, as shown in FIG. 18. With reference 
now to FIG. 18, this method preferably comprises the 
following steps: 

1. Analyze the audience composition 1801: Determine the 
target audience of the multimedia content. This analysis 
preferably includes the distribution of the user interests 
(e.g., how many users arc interested in video, and how 
many users like to hear the audio, etc,), the distribution of 
the platform (e.g., WinCE, Palm OS, Java), devices 
employed (e.g., WinCE devices, Palm, SmartPhone, 
WebTV, watchpad, wearable computer, general purpose 
PC/workstation, etc.), network connection (e.g., wireless, 
phone line, ADSL, cable modem, local area network, 
etc.), and connection bandwidth (e.g., from 9600 bps to 
1.0625 Gbps). 

2. Select Modalities 1802: Based on the distribution 
analysis, select target modalities for the given multimedia 
content. This step preferably includes the generation of a 
union of the necessary modalities from all users and all 
supported devices. 

3. Select Fidelities 1803: Based on the distribution analysis, 
select target fidelities. This step preferably includes the 
clustering of the range of bandwidth, device resolution, 
etc. 



audience of the multimedia content. This analysis prefer- 30 4. Generate modality-fidelity dependency graph 1804: This 



ably includes the distribution of the user interests (i.e., 
how many users are interested in video, how many users 
like to hear the audio, etc.), the distribution of the plat- 
form (e.g., WinCE, Palm OS, Java, etc.), devices 
employed (e.g., WinCE devices, Palm, SmartPhone, 35 
WebTV, watchpad, wearable computer, general purpose 
PC/workstation, etc.), network connection (e.g., wireless, 
phone line, ADSL, cable modem, local area network, 
etc.), and connection bandwidth (e.g., from 9600 bps to 
1.0625 Gbps). 

2. Select Modalities 1702: Based on the distribution 
analysis, select the target modalities for the given multi- 
media content. This step preferably includes the genera- 
tion of a union of the necessary modalities from all users 
and all supported devices. 

3. Select Fidelities 1703: Based on the distribution analysis, 
select the target fidelities. This step preferably includes 
the clustering of the range of bandwidth, device 
resolution, etc. 



40 
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step preferably includes generating descriptions (similar 
to FIGS. 11-14) for each of the terminal nodes of the 
multimedia content. 

5. Synthesize multimedia content 1805: Combine the mul- 
timedia content source according to the description 
scheme (as illustrated in FIGS. 7, 8, and 10), including the 
addition of the intra- and inter-object relationships. 

6. Materailizc modality and fidelity transformations 1806: 
Based on the usage statistics, those modalities in the 
modality- fidelity dependency graph are preferably mate- 
rialized by invoking the appropriate modality translation 
and fidelity transformation operators. 

7. Generate annotations 1807: Generate necessary annota- 
tions of each object, including the type, purpose, intention 
of use, priority of presentation, etc. that may be extracted 
from the original content. 

8. Repeat steps 5-7 if the current object 1808 is not a 
terminal object until all the content has been analyzed. 
In order to more clearly illustrate the possible applications 



4. Generate modality-fidelity dependency graph 1704: This 50 of the system and methods of present invention, several 



step preferably includes generating descriptions (similar 
to the examples of FIGS. 11-14) for each of the terminal 
nodes of the source multimedia content. 

5. Analyze content 1705: Decompose the multimedia con- 
tent source according to the description scheme (as illus- 55 
trated by the examples of FIGS. 7, 8, and 10) to extract 
InfoPyramid representations of each individual media 
modality, the intra- and inter-object relationships. 

6. Materailize modality and fidelity transformations 1706: 
Based on the usage statistics, those modalities in the 60 The objective of the Web image search engine is to catalog 
modality-fidelity dependency graph are preferably mate- images and video information on the World Wide Web and 
rializcd by invoking the appropriate modality translation allow users to search the catalog. The Web image search 
and fidelity transformation operators. engine preferably uses content descriptors to index the 

7. Generate annotations (1707): Generate necessary anno- images and video information by visual features, text, and 
tations of each object, including the type, purpose, inten- 65 semantics concepts. 

lion of use, etc. that may be extracted from the original In accordance with the present invention, the Web image 
multimedia content. search engine preferably employs a set of descriptors which 



examples are provided herein below. These examples, 
however, are not intended to limit the scope of the invention. 

EXAMPLE 1 

Web Image Search Engine 

In a Web-based search environment, the multimedia con- 
tent description system of the present invention may be 
utilized in the development of a Web image search engine. 
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are automatically and/or semi-auto matically generated- The characteristics and user preferences. The InfoPyramid may 

visual features of the images may be defined by a color also be used as a transient structure that facilitates the 

histogram and a texture vector which are preferably auto- transcoding of Internet content on-the-fiy to customize the 

matically computed. The system preferably assigns each retrieval and display of Interact content, 

image and video a set of terms which are automatically s WhUe comenl negoliation h not a blem ^d^d by 

extracted from the parent Web document and Web address. MPEG -7, it is desirable if the MPEG-7 representation also 

Furthermore, the Web image search engine preferably supported content negotiation mechanisms. Otherwise, two 

assigns various concept labels to each image and video by representations would have to be used: (1) MPEG-7 for 

looking up the assigned terms in a term<oncept dictionary. ^ (2) fof retrieval However> me contcnl 

This process is semi- automatic in the sense that the concept 10 ncgoliation framcwork * nccdcd t0 ^fy qucry . Dcpcnd . 

labels may be later verified manually Each concept class ffl on ^ a<xess ^ ided , 0 different 

belongs to a concept ontology that is also developed manu- componcrjts of mc contcnt A qucry may Dccd to 

aU y- examine only the textual transcript, as exposing the video 

The content descriptions in the Web image search engine representation to the query would be meaningless and a 

are represented using the InfoPyramid multimedia content 15 waste of network/computational resources, 
description language. The content descriptions types are 

defined as follows: EXAMPLE 4 
<IPMCD classnameo"color" baseclass«»"histogram(real) 

[166]" compare*»"Euclidean" owncr»"webimsearch" TV News Application 

spec="address"> </IPMCD>, 20 

<IPMCD classname-"texture" baseclass»"histogram A ™ News appUcaUon may be used to illustrate the 

(real)[9]" compare-"Euclide an" owner- con ^. °*} q{ °%™ 1 * and IPDL ' M 15 a typc ? f 
"webimsearch" spec-"address"> </IPMCD>, application that MPEG-7 would support, H gives an example 
... . ■. x.. representation for MPEG-7. 
<IPMCD classname-"text" baseclass- u set(term)" M y 
compare-"String" owner-'webimsearch" spec- This application automatically captures and indexes tele- 
address'^ </IPMCD>, vision news stories and makes them available for search 

<IPMCD classname-"concepts" baseclass-"set over ? e Internet - ^ s y slem ra P mres ° ews vide ° and J? e 

(concept)" compare="String" owner-"webimsearch" c osed "P 1 . 10 " ;^am time stamps and stores them. The 

spec-"address"> </lPMCD> 30 cIosed ca P tl0n 1S not aligned to the video due to live nature 

The Web image search engine specifies the content of news broadcasts. The system uses visual and audio cues 

description instances as follows: t0 ah S n mc closcd !° lhc Vldc °- I^cnscgmcnts the 

nnnnnn . ua ^ M ^^ AMj . n ^ tn „ news program into individual news stones. The text tran- 

dPMCD id-999999 color-^2034 11 242342342 ... scri P of * c cach containcd ifl lhc doscd Uonj b 

texture- <284 . . . text=«term /term2/term3/ . . fcd ^ g ^ yexer A ^ tben igs |hfc of 

SXcD C0nccptl/C0nccpl2/C01,ccpt3/ • • • > news stories over the Internet using text queries. First, as in 

</lFMLD> other quer y syS ( ems suc h as AltaVista, the system presents 

In this way, any search engine may search the catalog of & sum of lhc ncws storics matchi (hc ^ uscr 

image and video content descripUons. caQ thcn ^ fl stofy tQ ^ ^ ^ detaiR 

EXAMPLE 2 40 The video component, which can also be viewed as an 

InfoPyramid, has different fidelity levels corresponding to 

Satellite Image Retrieval System different representations. The basic level may be video in 

In a preferred content-based retrieval system of satellite ^in'v ^P*"* in >- Due t t0 

images.image content is represented as an InfoPyramid with hl & h dala rales f ft ^ }\ 15 su f bl * ^ for ^j" 1 * 

four modalities: (1) Pixel, or the original image (2) Feature 45 Jf 1 ""* °™ ^ J^* Vldc ° 15 ^mer compressed in 

(3) Semantic and (4) Meiadata. Bam u ba which H - 2 k 63 , vldeo Baraba v,dc ° 

v ' ..... . , can be streamed over the Internet to computers connected 

The present invention distinguishes between simple and with modems operating at ^.g kbs or higher> For further 

composite objects. A simple object can be defined as a resolulion reduction, the next layer may be a set of rcpre- 
region of an image that is homogeneous with respect to an 50 senlalive frames or key . frames> which provide a further 
appropriate descriptive quantity or attribute. A composite data . reduced repre sentation of the video. Note that here the 
object includes multiple simple objects with pairwise spatial resolution reduction lead to a modality transformation from 
(e.g., adjacent, next to, west of), temporal (e.g., before, after) yideo ^ ^ ima caQ be when a 
relationships. A simple object can be defined at any of the summary represen tation is required or one of the video 
modalities. 5S representa tions could not be served due to network band- 
This system can answer queries such as "find all the width or client platform capabilities. These images may be 
regions of cauliflower fields that have clubroot disease." statically displayed or synchronized with the audio to pro- 
Here, the search target is specified by a composite object vide a slide-show. The text component is obtained from the 
containing cauliflower field regions and a clubroot disease synchronized closed captions. This text is represented in 
regions. 6 0 reduced resolutions by summaries, tide and news categories 

or key words; all obtained by automatic tools or manual 

EXAMPLE 3 annotation. The audio is maintained in two levels, as a wave 

Internet ^ ( asS0C * Si * A d w * m ^ 6 AVI video) and an audio Bamba 

stream (associated with the Bamba video stream). The news 

In one application, the InfoPyramid is used to allow 65 story itself can be seen as represented at different resolutions 

content providers to represent Internet content in a form that from full AVI video through key-frames with text, to audio 

allows its customized delivery according to client device alone down to the level of just a text title for the story. 
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This InfoPyramid can be represented in IPDL using XML 

as: 

<NEWS-STORY> 

cSlation>ABC</Slalion> 

<date>ll/2/97</date> 

<time>5:00pm</time> 

<program> Evening Ncws</program> 

<video>http://videol.ipdl</video> <transcripl>http:// 
tcxtl.ipcil</transcript> 
</NEWS-STORY> 

These XML based representations of the new story 
InfoPyramid are easily readable and comprehensible. It is 
also easily parsed by machines. The linking mechanism 
makes explicit the interrelatioaships between the various 
components of a news-story and makes it storage indepen- 
dent. 

EXAMPLE 5 

Query Retrieval Relationship 

As wc have seen, content in genera) is multi-modal. For 
example, the new-story has many different video streams, 
audio streams, key-frames, textual transcript, key-words etc. 
This means, that depending on a query, the fight modality 35 
has to be exposed to the search mechanism. For example, 
text based queries require access to the textual transcript, 
while a visual search may make use of the key-frames or the 
video streams. 

Our contention is that query and retrieval are interlinked. 30 
A response to a query is the content matching the query 
being returned. Just as MPEG7 docs not specifically address 
the search mechanism, but the MPEG7 representations have 
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What is claimed is: 

1. A method for describing a multimedia content source 
including at least one terminal object, said at least one 
terminal object including one or more modalities, each of 
said one or more modalities having one or more fidelities 
associated therewith, said method comprising the steps of: 

generating a distribution of one or more modalities and 
fidelities, said distribution corresponding to an audi- 
ence of said multimedia content source; 

grouping said multimedia content source into one or more 
target modalities and target fidelities according to said 
distribution; 

generating a modality-fidelity dependency representation 
for a terminal object in said multimedia content source, 
said dependency representation including a description 
scheme comprising predetermined transformation rules 
for describing at least one of a relationship between two 
modalities and a relationship between two fidelities; 

decomposing the multimedia content source according to 
said description scheme to create an InfoPyramid rep- 
resentation of each modality; 

transforming said multimedia content source according to 
said modality-fidelity transformation rules; 

generating annotations for each object in said multimedia 
content source; and 

repeating said decomposing step, said transforming step 
and said step of generating annotations until every 
terminal object in said multimedia content source has 
been processed. 

2. A method for creating a multimedia content source 
including at least one terminal object, said at least one 
terminal object including one or more modalities, each of 



to support search; they also will have to support access even 
though MPEG7 does not address retrieval. It will be good if 35 SJud onc or morc modalities having one or more fidelities 
the same representation supports both search and retrieval. associated therewith, said method comprising the steps of: 



Just as in search, different components may have to be 
exposed to meet the search; different components may have 
to be returned when an access is made. For example, when 
the matching stories are returned as summaries for the aq 
results, the news story InfoPyramid has to return a summary 
representation (for example, containing the key frames of 
the video and a summary of the news). It would be inefficient 
for the news story InfoPyramid to return the news videos, as 
these may overwhelm the network and also the video 45 
representation makes it difficult to browse through a list of 
news videos. This mechanism for determining the best 
format of the content to return for satisfying a request is 
called content negotiation. In the TV News video 
application, the content negotiation decides which represen- 50 
tation of the new story to deliver based on the context: 
summary, full form, and client with limited bandwidth or 
limited display capabilities. 

Those of ordinary skill in the art will recognize that the 
present invention has wide commercial applicability to the 55 
exchange of multimedia content in general. Although illus- 
trative embodiments of the present invention have been 
described herein with reference to the accompanying 
drawings, it is to be understood that the invention is not 
limited to those precise embodiments, and that various other 60 
changes and modifications may be effected therein by one 
skilled in the art without departing from the scope or spirit 
of the invention. 



generating a distribution of one or more modalities and 
fidelities, said distribution corresponding to an audi- 
ence of said multimedia content source; 

selecting one or more source modalities and associated 
source fidelities, said source modalities and source 
fidelities being selected according to a union of said 
distribution; 

generating a modality-fidelity dependency representation 
for a terminal object in said multimedia content source, 
said dependency representation including a description 
scheme comprising predetermined transformation rules 
for describing at least one of a relationship between two 
modalities and a relationship between two fidelities; 

synthesizing said multimedia content source according to 
the description scheme and including predetermined 
intra-object and inter-object relationships; 

transforming said multimedia content source according to 
said modality-fidelity transformation rules; 

generating-annotations for each object in said multimedia 
content source; and 

repeating said synthesizing step, said transforming step 
and said step of generating annotations until every 
terminal object in said multimedia content source has 
been processed. 
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